EXPLORING RED WINE DATA by Jesus Fandino

Univariate Plots Section

Some observations:

Univariate Analysis

What is the structure of your dataset?

In this work, we used a dataset of a red wine (Vinho Verde) from Portugal, that contained information on almost 1600 samples. This dataset was first employed by Cortez et al. [1] in 2009 and is available for free.

Variables and Units (Unmodified dataset).

  • X: whole number, unique for each observation (Removed).

  • fixed.acidity: non-volatile acids found in wines, the most common are tartaric, malic, citric and succinic (as tartaric acid in g/L).

  • volatile.acidity: largely acetic acid, linked to the vinegary taste (as acetic acid in g/L).

  • citric.acid: found in small quantities, typically 1/20 of the tartaric acid concentrations (in g/L).

  • residual.sugar: the sugar that remains in wine after fermentation completes (in g/L).

  • chlorides: the amount of salt (as sodium chloride in g/L)

  • free.sulfur.dioxide: SO2 [molecular] + HSO3 [bisulfites] + SO3 [sulfites]. It is the buffer against microbes and oxidation (in mg/L)

  • total.sulfur.dioxide: free sulfur dioxide + bound sulfur dioxide [sulfites attached to either sugars, acetaldehyde or phenolic compounds] (in mg/L)

  • density: mass divided by volume, the density of wines are close to that of water (in g/cm3)

  • pH: -log10 of the activity of the hydrogen ion, it ranges from 0 to 14. (Zero is the most acidic, 7 is neutral, and 14 is the most basic); pH values for wines are typically between 3 and 4.

  • sulphates: The concentration of sulfates. Sulfur dioxide, upon oxidation, is converted into sulfate (as potassium sulfate in g/L)

  • alcohol: the amount of alcohol in % vol. For wines, it varies in a wide range, between 5.5% (Moscato d’Asti ) up to 21% (fortified wines).

  • quality: a whole number to qualify the wine. It ranges from zero (bad) to ten (excellent), it is a subjective measure. For this dataset, it was the median of a blind testing of at least three different experts for each sample.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

What is/are the main feature(s) of interest in your dataset?

The most noticeable variable in this dataset is the quality of the wine. That is a unique feature because it is the only one that does not come directly from an instrument or a mathematical derivation of a direct measurement. So far we depend on human experts to assign a quality number to a wine. For that reason, any way to correlate ‘objective’ variables with quality might contribute to our understanding of what makes a wine bad or good.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It appears that volatile acidity and alcohol content are the variables with the strongest correlation to quality. We are going to follow them closely, but also we are going to investigate most of the variables in our dataset.

Did you create any new variables from existing variables in the dataset?

Immediately after loading the .csv file, we deleted X (we did not find a proper use for it)

After creating the correlation matrix (that required only numerical variables), we proceeded to convert quality to a factor variable.

At that point, we also created three more ordered categorical variables using existing variables:

  • quality.level, with four levels: “Low” for quality equal to 3 and 4, “Medium Low” for quality equals 5, “Medium” for quality equals 6 and “High” for quality equals 7 and 8.

  • alcohol.level, with three levels: “Low Alcohol” for alcohol content below or equal to 10 % vol., “Medium Alcohol” for alcohol content higher than 10 % vol and less than or equal to 11.5 % vol. and “High Alcohol” for alcohol contents greater than 11.5 % vol.

  • SO2.level, with three levels: “Low SO2” for values of ‘total.sulfur.dioxide’ below one standard deviation (std) to the left of the mean, “Medium SO2” for values within one std of the mean, and “High SO2” for values in excess one std to the right of the mean.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The histogram for the variable ‘citric.acid’ was unexpected. There are some sharp peaks, that would be caused by the addition of certain preferred amounts of citric acid to the samples (including not adding citric acid at all). Other than to create new variables, as mentioned above, we did not modify the existing data in any way.

Bivariate Plots Section

There are some strong correlations in the data:

  1. pH and fixed acidity (Negative). Acid pH values are located in the lower part of the scale (pH < 7).

  2. citric acid and fixed acidity (Positive). Citric acid is a fixed acid.

  3. fixed acidity and density (Positive). Fixed organic acids are denser than water

  4. total sulfur dioxide and free sulfur dioxide (Positive). Free sulfur dioxide is a subset of total sulfur dioxide.

  5. citric acid and volatile acidity (Negative). Surprising, since the citric-sugar co-metabolism can increase the formation of volatile acid in wine.

  6. pH and citric acid (Negative). Citric acid is a fixed acid (same as in a)

  7. alcohol and density (Negative). Alcohol is less dense than water

  8. alcohol and quality (Positive). Subjective

So far, we can argue (according to our results), that human taste correlates high alcohol content with high-quality wines.
Chlorides and fixed acidity do not show any clear trend with quality level. Meanwhile, total sulfur dioxide seems to peak at quality equal to 5. The highest percentage of outliers belongs to levels 5 and 6.

The decrease in the median value of density is expected as the alcohol level goes up. The detail that could be a little more intriguing is why the interquartile range of density seems to increase for higher alcohol contents. In relation to pH values, water and alcohol are close, probably that reason helps to explain why the changes in pH are irrelevant. Low alcohol level samples tend to display more outliers than the rest (particularly in the cases of sulfates and chlorides). What does it mean?

The median values of volatile acidity increasing with SO2 levels was an unexpected outcome. However, we found that concerning the acetic acid content, the taste of wine starts becoming unpleasant at concentrations greater than 1.2 g/L. According to the picture above, that represents just a few samples in our dataset. Below that limit, winemakers usually change the values of volatile acidity pursuing the desired flavor.

Those four graphs display the strongest correlations among our variables. We are going to dig deeper into them, searching for any hidden dependency on quality. In the case of free sulfur dioxide vs total sulfur dioxide we removed the top 0.5 % of the total sulfur dioxide data (two outliers) from the plot.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As an average, as quality goes from 3 up to 8, the concentrations of citric acid, sulfates and alcohol increase. Volatile acidity, density, pH (not shown), and chlorides showed the opposite trend; they decrease as quality goes up. Total sulfur dioxide peaks at 5 and then decreases. Residual sugar (not shown) did not change appreciably within the quality range.

Variations in some variables across alcohol levels are, in general, less sharp than in the case of quality. We observed a slight average increase in the concentration of citric acid and sulfates as alcohol levels go up. Meanwhile, density, volatile acidity and chlorides dropped in the same interval of alcohol levels.

The changes were even harder to observe in the case of total sulfur dioxide levels (SO2.levels). In that case, a modest increase in the average value of density was observed as SO2 levels moved from low to high.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As expected, pH values went down as acidity increased, meanwhile density and alcohol display opposite trends. However, the volatile acidity tends to increase as SO2 levels go from low to high. That is counterintuitive, since one the multiple uses of sulfur dioxide is to inhibit the proliferation of the bacteria responsible for acetic acid production.

What was the strongest relationship you found?

We found a total of four relationships that were almost equally strong. The strongest among them was between pH and fixed acidity (Pearson’s correlation factor -0.683, that increases further if we take log10(fixed.acidity)). That high value of the correlation coefficient is expected as both magnitudes are related to the Chemistry of acid dissociation in the context of acid-base reactions. Fixed acidity also generates strong relationships with citric acid and density, as shown. Also, total sulfur dioxide and free sulfur dioxide are highly correlated.

Multivariate Plots Section

There is a general reduction of the scattering as the quality level rises in the four plots. That could be related to a more restricted change interval for the given variables as quality improves (i.e. maybe a change of 0.5 pH units for a given fixed acidity in some regions still produces a “Medium Low” quality wine but not a “High” quality one). The data points belonging to different quality levels are mixed in a way that with the aid of the variables plotted, we were not able to identify “regions” of given qualities.

The same four combinations of variables, now faceted by alcohol level, produced a result similar to the one previously shown for quality levels (correlation coefficient increases from low to high alcohol levels). Also, a quick look at the quality composition by alcohol level reveals that “Medium Low” quality samples dominate at low alcohol, “Medium” quality does it at “Medium Alcohol” and the “High Alcohol” level is almost exclusively populated by medium and high-quality samples. All of this adds support to the strong connection between quality and alcohol.

The Density vs. Alcohol plot, colored by quality level confirms what we already know: high-quality wines tend to have high levels of alcohol and low densities. There is not a clear pattern that could allow us to identify “zones” mostly populated by single-quality samples. When we faceted this plot by quality and change colors to match the alcohol level, we immediately have a clear view of the sample distributions by alcohol level for each quality level.

At the beginning of this study, we expected the sulfur dioxide content to be a factor in wine quality, due to its anti-oxidative and anti-microbial properties. What we can see by plotting volatile acidity vs. total sulfur dioxide is that the later could be found in a widespread interval of concentrations and still winemakers managed to obtain medium-quality wines.

The plot of sulphates vs. chlorides, colored by quality level (top), shows a “core” of points where all quality levels are included, with scattered points that could be easily treated as outliers and eventually discarded. When we faceted that graph by alcohol level and quality levels (bottom), a slightly different picture arises. It is clear that all those “outliers” occurred for the “Low Alcohol” condition only. They could be still outliers, or a sign that at low levels of alcohol the variables are allowed to change in a wider range. In any case, a more careful look is required.

Small details also contribute to a better understanding of the data. In this case, we plotted density vs. volatile acidity. From the first graph, it is hard to draw any conclusions other than the data appears to be dispersed without a clear trend. When we faceted it by SO2 levels, it becomes apparent that in average, the density increases with SO2 level, (as expected) and volatile acidity unexpectedly rises as the SO2 levels go from low to high (we noticed it in the comments related to the box plots above). The small detail there is that the dispersion of the data in the vertical axis seems to increase from lower to higher SO2 levels. A second faceting, this time by alcohol level, applied to the previous plot let us dig deeper into the data. For each SO2 level, there is a distribution of alcohol content in the samples. Those distributions have medians far apart enough to cause dispersion in density. The maximum in the relative amount of samples (within a single SO2 level) shifts from “Medium Alcohol” for “Low SO2”(86/178) to “Low Alcohol” with a tendency to increase, for the next two levels (522/1169, then 178/252).

When we plotted sulphates vs. volatile acidity using color to account for quality levels, we did not observe any trend at first. Faceting that plot by quality level allowed us to realize that medians of sulphates and volatile acidity in the “Low” and “High” quality groups of samples were separated enough to be noticed. They were just overshadowed by a cloud of points belonging to intermediate values of quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When we split the four stronger correlations across quality levels, we observed that they tend to be more robust as we moved in quality from low to high. The same applies if we substitute quality levels by alcohol levels.

Density has a strong negative correlation with alcohol. When we plotted density vs. alcohol in a scatter graph, colored by quality level, we acknowledged that a significant part of the “Medium Low” quality samples lies to the left of the alcohol median value. Also, most of the “High” quality samples are to the right of that median. It is hard to make any statement regarding “Medium” or “Low” samples. When we faceted that plot with quality levels, colored by alcohol level, we can see how the relative composition (regarding alcohol levels) changes across the quality levels. It goes from low-alcohol reach at low quality to low-alcohol depleted at high quality. The higher relative amount of high-alcohol samples at high quality helps to explain why the median density decreases from low-quality to high-quality wines.

Volatile acidity and total sulfur dioxide are a pair of variables that are not correlated. By plotting a scatter graph of volatile acidity vs. total sulfur dioxide (removing two outliers), colored by quality level, we realized few things. High-quality samples tend to have low values of volatile acidity, low-quality samples show more dispersion in volatile acidity than the rest and medium-low-quality samples are spread all over the total sulfur dioxide axis. The same plot, faceted by alcohol level reveals that low-quality samples, highly dispersed in the total sulfur dioxide axis, dominate the low alcohol level. The “Medium Alcohol” level contains proportionally more high-quality samples, pushing the median volatile acidity to lower values. At the “High Alcohol” level, there are almost exclusively medium and high-quality samples, which lowers the median volatile acidity even further. Red wines with “High Alcohol” are characterized by relatively low dispersion and low median values of volatile acidity and total sulfur dioxide. That is something good since volatile acidity is linked to the vinegary taste and high levels of total sulfur dioxide have a bad influence on taste as well.

Were there any interesting or surprising interactions between features?

The sulfates vs. chlorides scatter graph shows dispersion in both axes and giving color to the samples by quality does not help to understand what is going on. When this plot is faceted by quality levels and alcohol levels it becomes clear that the issue occurs exclusively with low alcohol samples.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

We resisted the temptation of creating a new variable named “total acidity” by adding up “fixed.acidity”, “volatile.acidity”, and “citric.acid”. A closer look at the variables makes clear that they are expressed as different acids. Unfortunately, even by doing that correctly with the intention of creating a model to predict the pH values based on total acidity (and using a simple equilibrium constant model), our predicted pH values were off by 0.3 to 0.9 pH units. For that reason, we decided to exclude it from this work.

Final Plots and Summary

Plot One

Description One

The bar plot displays the composition of our dataset regarding quality and alcohol levels. It does not only show that quality 5 and 6 are the most populated but also that the number of low alcohol samples peaks at quality equals to 5, medium alcohol peaks at 6 and high alcohol does it at 7. The relative composition of alcohol levels within a single quality value changes. At low values of quality (less than or equal to 5) samples with low alcohol dominate, at quality equals to 6 “Medium Alcohol” samples are the most common and finally the composition shifts towards “High Alcohol” at 7.
We chose the “quality.level” variable in such a way that it is consistent with this graph. We merged samples from qualities 3 and 4 to create the level “Low,” and those from quality 7 and 8 to create “High.” We decided to leave 5 and 6 as independent levels (“Medium Low” and “Medium,” respectively) mainly because they have different alcohol level compositions.
According to this graph high quality is associated with high concentration of alcohol in wines.

Plot Two

Description Two

Although alcohol seems to be relevant to wine quality; other variables also change through all the quality range. In the figure above we chose four of them: citric acid, volatile acidity, sulfates and density. They all have monotonic trends with quality, and while citric acid and sulfates increase, volatile acidity and density do the opposite as quality goes from 3 to 8. Citric acid is a weak organic acid; its addition helps by chelating metal ions to prevent browning, increasing the acidity in the process. Too much citric acid affects the taste of red wines negatively and causes microbial instabilities since bacteria use citric acid in their metabolism.
Volatile acidity is mostly due to the presence of acetic acid in wines. It is responsible for the vinegary odor and taste. For that reason, it is not a surprise that there is a decreasing trend between volatile acidity and quality. Wine is water, for the most part. That explains why its density is close to 1 g/cm3. The presence of alcohol makes wine density lower. As the alcohol % vol increases with quality, we expect a decrasing in wine density as quality rises. In a simple water-ethanol solution a 10 % vol alcohol concentration (which is typical for wines) would lead to a density of about 0.97 g/cm3. What we observe in our data are higher values of density, so there are other components of wine also playing a role. The final resul is a slim density interval. One of the known sulfates (or sulphates in British English) used in wines is cupper sulfate. It is a fining agent, used to remove unpleasant aromas in wine, particularly those related to hydrogen sulfide (rotten-egg-like off aromas). That explains, at least in part, the positive correlation between sulphates and quality.

Plot Three

Description Three

After plotting some data, it seems like variables belonging to samples with qualities 5 and 6 are dispersed throughout the entire range. It makes near impossible to find clean areas in a two variables space (not involving a quality variable) where clear boundaries between quality levels exist. It appears that, after centuries of wine mass production, winemakers at Minho province (Portugal) have found a safety range of values for the variables under consideration that allows them to produce a significant percentage of medium qualities (5-6) red wines. We decided then to explore what happens to just the extreme levels (“Low” and “High” quality). In the figure above we eliminated data points from quality levels 5 and 6. We plotted sulphates vs. volatile acidity (top), and then faceted by alcohol levels (bottom).

We observed some regularities:

  • The data points are segregated by quality; low-quality points tend to have low values of sulfates and high values of volatile acidity. The opposite occurs for high-quality data with little overlap.

  • The dispersion decreases as the alcohol level increases, consistently with some trends in the general data.

  • The relative amount of samples also changed from a low-quality majority at “Low Alcohol” to almost all “high-quality” at “High Alcohol.”

Reflection

Assigning a quality number to wine relies entirely on the human taste in all its subjectivity realm. Throughout this study, we tried to identify variables and conditions that could help us rationalizing what remains otherwise more an art than a science. We found that in average, alcohol and sulfates have a positive influence on wine quality meanwhile volatile acidity has an adverse impact on it. We also noticed after some search that there is at least one configuration of three variables that could potentially improve the overall quality of the batches by minimizing low-quality wines. In the future, it could be interesting continuing the exploration in that direction. We acknowledge that the small amount of data could be a serious drawback.

References

[1] P.Cortez, A.Cerdeira, F. Almeida, T. Matos and J. Reis, “Modeling wine preferences by data mining from physicochemical properties” Decision Support Systems, 47(4), 2009, 547-553, ISSN: 0167-9236.